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Abstract 

We present a base-pairing model of oligonuleotide duplex formation and show in detail its equiv- 
alence to the Nearest-Neighbour dimer methods from fits to free energy of duplex formation data 
for short DNA-DNA and DNA-RNA hybrids containing only Watson Crick pairs. In this ap- 
proach the connection between rank-deficient polymer and rank-determinant oligonucleotide pa- 
rameter, sets for DNA duplexes is transparent. The method is generalised to include RNA/DNA 
hybrids where the rank-deficient model with 11 dimer parameters in fact provides marginally 
improved predictions relative to the standard method with 16 independent dimer parameters 
(AG mean errors of 4.5 and 5.4 % respectively). 
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1 Introduction 



Simple nearest-neighbour (NN) models of helix stability at the base dimer level have been 
refined over some years CP-|1] and are commonly used, for example, to predict RNA secondary 
structure formation [3], [H] and duplex melting profiles (for a recent example see j7]). Detailed 
knowledge of competing RNA/RNA and RNA/DNA bindings is also becoming desirable for 
cDNA microarrays j8|. 

Previously we [HI suggested that a deeper systematic framework lay beneath the semi- 
empirical rules of the NN approach, while in Ref-JHI we proposed such a scheme and demon- 
strated its equivalence to another [2] model of oligonucleotide RNA duplexes containing WC 
pairs. 

The structure of the paper is as follows. After a brief outline of the NN dimer method we 
refine our base-pairing model proposed in Ref. ^U]. The equivalence between the two approaches 
is then described in detail. Finally two-dimensional models are fitted to empirical AG values 
for DNA and RNA/DNA hybrid duplexes and compared statistically to their dimer analogues. 

2 Nearest Neighbour model 

The Nearest-Neighbour (NN) model of duplex formation, e.g see Ref. j2] is based upon the 
assumption that once an initiation barrier, preventing the formation of a single H-bonded pair, 
is overcome the duplex/single strand interface propagates along the strands "zipping" the two 
strands up into a duplex. The critical assumptions are: 

1) The process occurs at fixed strand concentrations. 

2) Derivation of thermodynamical parameters from melting curves is for an assumed equi- 
librium between two (helix and coil) possible states. 

3) The thermodynamic properties of the duplex depends linearly upon the frequencies of 
adjacent pairs of bases (dimers). In particular the formula used to estimate duplex free energy 
of formation is 

AG = AG.mt + AG 

Here AGinit is a free energy initiation step including translational and rotational entropy loss. It 
is, in principle, both (duplex) length- and sequence-dependent but for short duplexes is typically 
assumed to depend only upon the identity of the terminal base pairs. Niyy/xz denotes the fre- 
quency with which dimer 5'WX3' /3'Y Z5' occurs in the duplex while AG\yy/xz is its free energy 
contribution to the helix formation. The latter is interpreted as the dimer propagation energy 
associated with the zippering of strands. Dominant contributions to this propagation energy are 
the van der Waals "stacking" between adjacent base pairs and specificity-conferring H-bonding 
of complementary base pairs. Finally an extra entropy term is added for self-complementary 
duplexes (i.e. helices formed from strands with identical sequences) due to the extra twofold 
symmetry. Theoretically AG ssym = RThi2 ~ 0.43 kcal mol~^ at physiological temperature 
T = 310A". 

2.1 Model parameters 

In the simplest instance one assumes that the dimer propagation energy is independent of 
"zippering" direction, that is, AGy/y/xz = ^Gzx/YW a^id therefore Ny/y/xz = ^zx/YW- It 
is easy to check that if p types of pair are possible then this symmetry reduces the number of 
types of dimer from (2p)^ to p{2p + 1). We shall refer to this below as the "symmetric" dimer 
approximation . 



For oligomers, i.e., helices with formally defined ends, there are also 2p possible terminal 
parameters. In this case the hypothetical 5' — 3' dimer symmetry must be compatible with the 
global 5' — 3' symmetry of the full duplex, leading to p constraints upon the numbers of 
independent dimer {N-\^x/Yz) and terminal (N^'p/qy) frequencies which coexist in the model. 
For convenience p of the terminal parameters are typically eliminated, e.g., N^ip/gy = N^iq/p^f. 
The maximal number of independent NN frequencies in the model is therefore 

u = p + p{2p + 1). 

Oligomer models derived for RNA (2] and DNA duplexes 0, J2] containing Watson-Crick (WC) 
pairs {p = 2) exhibit good agreement with experimental data for short, "two state" duplexes 
while accuracy is decreased for duplexes containing mismatches (Refs. |13|-jl7| for DNA or 
see jl8j-|2Uj for RNA. This decrease is generally attributed to the emergence of non-nearest 
neighbour effects such as pairing geometries with reduced complementarity. The "asymmetric 
dimer" approximation, where the zipping direction is distinguished, has (2p)^ dimer parameters 
and may be applied to hybrid RNA/DNA duplexes or, possibly, as a more sophisticated model 
of ordinary (RNA/RNA or DNA/DNA) homoduplexes. In fact the asymmetric dimer model is 
required to obtain reasonable predictions for RNA/DNA hybrids (^U, H^, ^) but provides 
only marginal improvements in predictions for homoduplexes [2], |inj . 

It is also known [SSj for WC-paired DNA polymers that a model consisting of eight "invari- 
ants", or linear combinations of dimer steps, provides predictions of comparable accuracy to the 
model including parameters for all ten dimers. Previously we suggested that the base-pairing 
model developed in Ref. for RNA duplex formation is more naturally compared with the 
NN method at this independent, short sequence (ISS) level, which we elaborate upon below. 

3 Base-pairing model 

Our model is motivated by the problem of including sequence-dependence in dynamical, base- 
pair level descriptions of DNA It follows from the NN assumption (3) above; if a physical 
property is a linear function of the duplex NN dimer quantities, then there must be an equivalent 
expression which is quadratic in labels associated with individual bases. 

For the instance of thermodynamical quantities this quadratic description corresponds to a 
systematic summation of two-body correlations. Given the pre-eminent role of van der Waals 
"stacking" and H-bonding interactions in helix stabilisation, these correlations might be repre- 
sentative of the interactions between, for example, amino/keto functional groups and heterocyclic 
rings. Moreover terms which are independent of sequence content variations, depending only 
upon overall sequence length have a natural interpretation as contributions to the generic (B 
form, in the present case) helical backbone structure. 

Consider the dimer propagation energy parameters AG^^y/xz associated with the "zipping 
up" of dimers h'WY?>' /?>' XZb' . We shall assume that they may be decomposed quadratically as 

AG,„,,,.(H'-X-,(Jj; (2) 

Here the matrix entries Pij represent generic correlations between the various sites of the dimer, 
while the vector entries W , X, Y, Z are vectors encapsulating the sequence variation. For lack 
of better nomenclature we shall refer to matrix elements as "correlations" and the vectors as 
"weights" . Three observations are in order: 



1) The diagonal entries Pa (we shall call them ha) are associated with bases which are H- 
bonded while off-diagonal ones are associated with "stacked" neighbours (which we call s = P12 
and t = i-*2i)- 

2) W, X, Y, Z are vectors in some abstract "weight" space. In order to reproduce the 
correct number of matrix elements, the dimensionality, p of this space must coincide ^01 with 
the number of permissible base pairs, d above. The nature of this weight space is discussed in 
a later section. 

3) Given the H-bonding specificity between complementary pairs, the inter- and intra-strand 
correlations are not independent of one another. The stacking correlations are thus understood 
to contain contributions from both these types of interaction. 

Therefore expression @ can be rewritten in the form used previously |lUj . 



where bases Xi and yi refer to bases on the 5' — 3' and 3' — 5' oriented strands respectively and 
the Roman index i refers to the location of the base within the dimer. We shall also adopt the 
convention that Greek indices signify internal "weight space" degrees of freedom. 

The distinction between symmetric and asymmetric dimer approximations is quantified by 
the existence of a 5'— 3' symmetry transformation, In the spatial point group of a dinucleotide 
dimer this transformation is just a C2 rotation about an axis perpendicular to the average plane 
of the (dimer) molecule passing through the centre of mass. Given that this model is concerned 
with correlations between abstract labels in a multi-dimensional vector space and not a priori 
in physical 3-D space we can only require that the transformation be involutive, i.e., 



where Ip denotes the p x p identity matrix. In order for expression Q to be manifestly 5' — 3' 
symmetric it must be the case that: 



For any suitable (involutive) choice of r2 it is readily found that the number of independent h, s 
and t matrix elements is p{2p+l) in agreement with the number of symmetric dimer parameters. 
Without loss of generality, in the remainder of the paper we shall therefore assume r2 = — Ip. 

3.1 Duplex model 

We now construct the correlations for a full duplex of n base pairs xi . . . . . . y„. As in the 

NN approach the duplex heat of formation is approximated by a summation of internal dimer 
propagation energies plus, for oligomers, terminal-environmental effects: 




(r2)' = ip 




(3) 
(4) 
(5) 




(6) 



i=l 



Here, c.f. Eq the summation is over individual sites within the duplex rather than dimer 
occurrence frequencies. Let us denote the duplex analogues of dimer submatrices h, s and t by 



capital letters. If the global 5' — 3' transformation is written r„ the global symmetry constraints 
are 

Hi-iTnfHfTrr = 0, (7) 

Sij-{TnfS^Tn = 0;j>i, (8) 

ri,-(r„fr|r„ = 0;j<i, (9) 

with i and j running from 1 to n and where we have defined the 5' — 3' reflected index i = 
n + 1 — i. Naturally in the NN approximation only those submatrices with indices |? — j| < 1 
contain nonzero elements. Reconciling the global and dimer symmetries is therefore equivanent 
to substituting the symmetric dimer version of Q in ©. In particular one finds ^U] 

Hi = hi = Hi, H = H2 = hi + hJ = Hs = ... = Hn_i, (10) 
Si,i+i = s, Ti^i^i = t. (11) 

Note that for internal bases the H-bonding correlations effectively contribute twice and in such a 
way that Hi = HJ for 1 < i < n. This is desirable due to the fact that constraint on its own 
would imply different symmmetry properties for the central base pair(s) in odd- (even-) length 
duplexes. We note that this double counting of internal base pairs with respect to terminals 
may also be justified in terms of the relative propensities for fraying of the latter |23j . 

Additional, environmental, perturbations to terminal bases may be formally incorporated by 
augmenting the original sequences with fictitious "end neighbour" vectors (24j 

{xi, . . . Xn) > (fix! Xl, . . . Xfi, 6^) 

iyi,---yn) ^ {ey,yi,...yn,e'y). 

If these perturbations are themselves 5' — 3' symmetric then the environmental effects may be 
incorporated as a linear term 

^Gterm/env = P-{xi + Xn + Vl + Vn), (12) 

where /3 is a constant, p-dimensional vector which is not required, i.e., vanishes in expressions 
for polymers or circular duplexes, if no ends are formally defined. 

3.2 Number of Model Parameters 

Let us now recall the observation j22] that there is a more fundamental description of WC-paired 
polymers in terms of eight "invariants" and that moreover, when initiation terms are added, the 
predictions for oligomers are comparable to those of the full model with ten dimer parameters. 

Observe that bases yi and Xi have, to this point, been effectively treated as independent 
variables, but as is well known, the complementary H-bonding of nucleotide bases is highly 
specific. Thus we can assume that if the types of base-pairing occuring within the duplex are 
known then a relationship between complementary bases might be exploited. For example, if 
only WC pairing geometries occur in a given duplex then it follows that if site yi contains the 
base C, site Xi must contain a G. Thus the labels specifying one of the two bases are redundant. 

In general for a base pair we may therefore write 



Vi — 'JiXi — XiRiXi, 



(13) 



where A > is a scaling factor while R is, in general, some length-preserving transformation 
and the matrices ji are assumed to be characteristic of a particular pairing geometry. It follows 
from the expression 

yfhxi = XixjRfhxi, (14) 

that the off-diagonal entries of (R^h) only occur in linear combinations, thereby reducing the 
number of independent dimer parameters by p{p — l)/2. 

In the special case of WC-pairing geometries {p = 2) we therefore have 9 such parameters. 
To obtain a model with eight parameters we require a further assumption about the form of -Rj. 
Prom 1)10(1 the internal H-bonding correlations are symmetric and it is easy to see that the four 
elements H'^^ occur in just two linear combinations precisely when R has zero values along its' 
diagonal. This effective eight-parameter treatment of internal base pairs is directly comparable 
to the description of polymers, i.e. sequences with no mormal ends, in terms of eight "invariants" 
or ISS. 

Now consider the difference between the contributions of 5'P/3'Q and 5'Q/3'P pairs. If the 
pairs are internal, since H = the H-bonding enthalpies are equal: 

q Hp = p Hq. 

It follows that any orientation effects are manifest only at the terminals where 

rri rri rri r/^ 

q hp = p h q ^ p hq 

Thus in the WC case the difference /112 — /121 is a measure of the magnitude of these effects. 
If, as is commonly assumed in the NN approach for DNA and RNA duplexes, 5'G/3'C and 
5'C/3'G terminal pairs are equivalent (and similarly for A.T/U) this quantity should be a small 
correction. Therefore the model with eight dimer parameters obtained by assuming hi2 ~ /121 
should be a reasonable approximation of the full, 9 parameter version, which we verify in the 
results below. Analogously in Ref. 25 the ability of the "rank deficient" DNA polymer model 
to reproduce DNA oligomer data was rationalised thus: "...most of the sequence dependence of 
oligonucleotide DNA thermodynamics is captured in the first eight terms and the remaining two 
are small perturbations..." 

In two dimensions there are just two candidates for the transformations Ri defined in p4() . 
these are proportional to 

"^=(-1 J)' 

If, for simplicity the scalings Aj = 1, the latter form can be rejected since (T2 = cr2^ ■ This would 
ensure that the six stackings s"^, always appear in just three linear combinations, 

reducing the number of independent parameters from 8 to 5. For 7 = itcri the property 7"^ = —7 
means that the relative signs between s and t elements are sequence-dependent, thereby ensuring 
the number of model parameters is fixed at eight. 

3.3 Equivalence with dimer model 

For clarity we now show the equivalence of the present model with the NN approach for canonical 
duplexes. Firstly the duplex pairing matrix is written in terms of Hydrogen-bonding and stacking 



components: 



AG(X,y) = ^Ghb + ^Gsti (15) 

n-1 

^Ghb = yJhxi+y^hxn + '^yfHxi, (16) 

n-1 

= ^ (yf + y,?;itei) . (17) 

i=l 

Now consider a dimer f>' XZ'i' /2>'WYh' occuring with frequency Nxz in a given duplex. For 
symmetric dimers Nxz = Nyw and the stacking term AGst depends on ten linear cominations 
of these 16 frequencies: 

AGsT = '^NgcSgc/gc + '^^cgScg/cg + '^NatSat/at + '^NtaSta/ta 

+{Nga + Ntc)Sga/tc + {^GT + Nac)Sgt/ac + {^CA + Ntg)Sga/tg 

+{Nag + Nct)Sag/ct + {Naa + Ntt)Saa/tt + (^gg + ^cc)5'gg/gc(18) 
p 

a,/3=l 

Similarly, if /i = i//2, the H-bonding terms contain four numbers: 

p 

AGhb = {riGc + 2ni;G)hGC + {riAT + 2n\T)hAT; H^y = ^ h'^^yo.xp, (19) 

Q,/3=l 

where n\j(j, n^GCi '^AT' '^AT denote the numbers of terminal and internal G.C and A.U pairs 
respectively. These numbers of base pairs are not independent of the stacking frequencies Nxy 
however: 

{n^GC + "^r^Gc) = ^GA + Ngt + 2Ngg + ^Ncc + Nag + Ntg + 2Ncg , 
(n^AT + 2n^y) = 2Naa + 2Nat + Nag + Nag + A^ga + 2Nta + iVcA- 

Combining these identities in ()19|) with ()18|) one sees that the coefficients of Nxy in (|15() contain 
both H-bonding and stacking correlations. In this way one obtains ten dimer parameters equiv- 
alent to those of the Interacting Nearest Neighbour with H-Bonding (INNHB) |2] (see Table |2 in 
the results for verification of this). Furthermore the ten dimer parameters in our model are just 
linear combinations of the eight matrix elements, similar to the eight ISS parameters discussed 
inRef. 

3.4 Weight space 

Having discussed the 2-body correlation matrix in detail we now turn to the vector "weight" 
space. In order to obtain a model which is equivalent to the polymer dimer model we have 
imposed just two "rules" on the weight space: 

(1) The vectors for complementary bases are orthogonal and of equal length. 

(2) The model of duplexes with WC pairing must have p = 2, therefore the vectors for 
G,C,A,T all live in the same (two-dimensional) space. 

Note that there are still 2p unknown parameters (the coordinate values for, say G and A) 
however, rather than attempt to obtain them from a fit to empirical data assumptions may be 
made about what the vector coordinates represent. 



For WC pairs in ref. ^Oj we assumed one coordinate counted the number of H-bonds formed, 
the other indicated whether the the heterocychc ring was purine or pyrimidine. With the basis 

{G, C, A, U{T)} = {(^, 1/2), (-1/2, V3), (- ^2, 1/2), (-1/2, - ^2}, (20) 

we obtained an 8-parameter, so-cahed "rank-deficient", model of RNA ohgonucleotides statis- 
ticaUy identical to the conventional dimer model (e.g. Refs. 0, 0). To obtain fitted models 
from DNA/DNA and RNA/DNA data below we shall assume that the same degrees of freedom 
1)20(1 contribute to WC pairing. The effects the of differing sugar backbone geometries will be 
manifested in the relative magnitudes of the fitted coupling values for the various duplexes. 

4 Results: WC Pairs in DNA 

The next stage of our comparitive analysis is to obtain the DNA pairing model from data for 
duplexes with two-state melting transitions and consisting of only Watson Crick pairs. Following 
Refs [3], we construct models for thermodynamic parameters AG, Aiif, AS for two sets of 
sequences one set 44 of duplexes terminated by GC pairs only, the other containing, in addition 
to these 44, eight duplexes with at least one AT terminal. 

We use an initiation term with separate b'T/'i' A and 3'T/5'A parameters, consistent with 
the observation of SantaLucia et al. Z\ that the former have a tendency to fray: 

^Ginit = airiQc + +"2^5,2- + a3?^5'A- (21) 

The self-complementary entropy "penalty" AGgym is set to the t heoretical value 0.43 kcal mol~^ 
at a temperature of 310K. The form of our fitting function is thus given by ()21j) plus ()16j) and 

AG{X,Y) = AGinit + AGsym + yfhxi+y'^h^Xn 

n—l n—1 
i=2 1=1 

Here h, s and t are 2x2 matrices while vectors Xi, yt take the appropriate values from the base 
vector set ()2()|). Due to the assumed 5' — 3' dimer symmetry s and t are symmetric, while we 
have kept the distinction between h and for the purpose of comparing rank-deficient and 
rank-determinant parameter sets. 

To compare the NN dimer and base-pairing approaches we shall compute the same statistical 
parameters used in other studies, for example, Ref. ^2.. In addition to the root of the mean of 
squared residuals a we include the unweighted parameter is computed to be 

x'=E(^^)'. P3) 

where Gp, Go and a are the predicted and observed values of G and the rms value respectively. 
Here / is the number of observations less the number of model parameters. The reduced param- 
eter x^/f should have value close to unity for a good fit. The Q value estimates the likelihood 
of obtaining a particular value of by chance: 



Q = r(//2,xV2)/r(//2) 



(24) 



where r(a) and T{a, z) denote the complete and incomplete gamma functions respectively. Small 
Q values signify that discrepancies between the model predictions and experimental data are 
unlikely to be due to chance. 

Using Eqs ((231) > we compute these statistics for the predicted values of AG in six 
models in Table 21 The models NNl and NN2 are respectively the 11- and 12-parameter models 
obtained in Ref. (31 for sets of G.C- and G.C/A.T- terminated duplexes. JBla and JBlb denote, 
respectively the data for rank-determinant {h = K^) and rank deficient {h ^ h?") models for the 
44 GC terminated duplexes. Similarly JB2a and JB2b denote the same models for the full set 
of 52 sequences. Several observations may be made immediately from Table El Firstly, the fit 





a" 


xVf 


Q 


NNl 


0.33 


1.22 


0.10 


JBla 


0.33 


1.26 


0.14 


JBlb 


0.33 


1.22 


0.17 


NN2 


0.35 


1.30 


0.10 


JB2a 


0.34 


1.28 


0.12 


JB2b 


0.34 


1.24 


0.14 



Table 1: Comparison of statistics for the standard Nearest-Neighbour (NN) parameters and the 
fitting of our models to the same data. Units are kcal mol~^. 

rms values for all models are in good agreement, however if the distinction between AT and TA 
terminal parameters is removed the rms values of JB2a, JB2b would rise respectively to 0.41 and 
0.39 kcal mol~^. Note however that introducing separate parameters for 5'G/3'C and 5'C/3'G 
in all instances provides slight improvements in a values of < 0.005 kcal mol^^ while increasing 
X^/ f and decreasing Q. Therefore the optimal choice of initiation term (|21|) is validated. 

The parameter set for estimating AG for the best model, JB2a is given by (units are kcal 
mol~^) 

ai = 0.833, 02 = 0.98 as = 1.84 

/ 0.139 -0.293 \ / -0.071 -0.033 \ _ / -0.190 0.019 \ . - 

~ V -0.305 -0.139 y ' ~ V -0-033 0.180 j ' ^ ~ V 0.019 0.055 J ' ^ ' 

Of course the models should also be compared at the level of dimer propagation energies. These 
quantities are found from ()22|) via 

^Gxz/YW = X^hY + Z^hW + Y^sZ + W^tX 

and using the basis (|20|) and parameters (|25j) . Values are compared to the dimer quantities 
of Ref.|3] in Tabled below. For completeness the fitted matrix elements for AH and AS and 
related dimer quantities are included in the appendix. 

5 Results: WC pairs in RNA/DNA hybrids 

We now modify the approach in order to analyse thermodynamic data for RNA/DNA hybrids, 
in particular the 68 sequences used by Sugimoto and co-workers 0, and compare our results to 
previous NN analyses (5, |12j . 



AG 




BP^ 


GG/CC 


-1.77 ±0.06 


-1.76 


GC/CG 


-2.28 ±0.08 


-2.28 


CG/GC 


-2.09 ±0.07 


-2.09 


AA/TT 


-1.02 ±0.04 


-1.01 


AT/TA 


-0.73 ±0.05 


-0.77 


TA/AT 


-0.60 ±0.05 


-0.63 


GA/CT 


-1.46 ±0.05 


-1.52 


GT/CA 


-1.43 ±0.05 


-1.37 


AG/TC 


-1.16 ±0.07 


-1.13 


TG/AC 


-1.38 ±0.06 


-1.36 



Table 2: Comparison of NN symmetric dimer parameters with those derived from the rank- 
deficient parameters H25() . Ah units are kcal mol~^. "Dimer parameters from 12-parameter 
model obtained by SantaLucia et al. 3 . ''BP denotes the base pairing model with rank deficient 
parameter set JB2b. 



The major difference to DNA or RNA hybrids is that, owing to the different backbones of 
the strands the zippering direction is readiliy distinguished; there are no local dimer or global 
duplex symmetries, nor a self-symmetric entropy term. Indeed a naive fit of the symmetric 
dimer fitting function 1221 vields an rms value of 0.57 kcal mol~^, considerably poorer to the 
homoduplex models. 

The dimer propagation matrix is therefor given by ^ where constraints (IHHSl) do not hold 
and the number of independent matrix elements is simply (2p)^. The comparison of dimer and 
global correlations now yields 

Hi = hi, Hn = /l2, 

H = H2 = hi+h2 = H3= ... =Hn-u (26) 
•^iji+l = S, Ti^i^i = t. 

Hence the fitting function is now given by 

AG(X,y) = AGinit + yIhiXi + ylh2Xn 

n— 1 n— 1 

+ XI ^^^(^1 + ^2)2;^ ± ^(yf sxi+1 + yj+itxi). (27) 

1=2 i=l 

where, in contrast to the single initiation term of Ref [1] we shall assume distinct terminal 
parameters for the terminii 

r{G)/d{C) = r{C)/d{G) r{A)/d{T) = r{U)/d{A). 

Of course in general four such parameters may be distinguished, however initially we shall 
consider an initiation energy form 



The resulting model has, naively 2 + 16 parameters to be compared with 1 + 16 for the NN 
model [4j. However RNA/DNA hybrids, like homoduplexes exhibit high complementarity in 
H-bonding. Since the "weight vectors" 1)20(1 have been shown to capture the essential sequence- 
dependent interactions of both DNA/DNA and RNA/RNA ^Uj helix formation and, noting 
that the RNA/DNA hybrid also has a regular helical geometry, it is reasonable to investigate 
whether they are successful in the latter instance. 

With the existence of complementarity transformations (jl4|) . two of each of the hi and /12 
matrix entries appear in single linear combinations. For the DNA/DNA model we obtained 
a rank-deficient model by neglecting the distinction between the H-bond correlations of the 
two terminii. The analogy in the hybrid case is to suppose hi ~ /12, the validity of which we 
check below. The rank-deficient parameter set for hybrids therefore has 2+11 parameters, an 
improvement of 4 over the NN model. 

In fact we find that this model reproduces the empirical AG data (rms 0.38 kcal mol~^) 
slightly better than the NN models and in addition gives better reduced x^/f Q 

values, see Table El The fitted values obtained (in kcal mol~^) are 

ai = 1.65, 02 = 1.52 

/ -0.151 0.277 \ / -0.035 -0.065 \ _ / 0.010 0.060 \ , . 

~ V 0-369 0.151 y ' ~ V 0-002 -0.011 j ' ^ ~ V 0-078 0.035 J ' ^ ^ 

Sugomoto and coworkers used a single helix initiation parameter of 3.1 kcal mol^^ which is 
consistent with our initiation term which takes the values 3.05, 3.17 or 3.30 kcal mol~^ depending 
on whether the duplex has respectively 0, 1 or 2 G.C terminals. 

Several other observations may be made about the matrix elements ()28|). Firstly the dif- 
ference hi2 — /i2i is roughly four times larger in the hybrid case, therefore the model already 
incorporates significant distinctions between 5'^/3'T and 5'T/3'A (and similarly for GC) ter- 
minii. Doubling the number of initiation terms therefore leads to only a small increase in 
accuracy for the cost of two extra parameters. 

More surprising is the result that su ~ — 122 and S22 — — ^ii would provide excellent approx- 
imations, reducing the parameter number further to 11. In the absence of an obvious theoretical 
reason for this coincidence of stackings we cannot reject the possibility that it is an artifact of the 
data. The rank deficient parameter set, denoted HI in Table |S1 for RNA/DNA hybrids there- 
fore contains 13 parameters. Finally we consider the 16 parameter rank-determinant {hi 7^ /12) 
model, denoted H2 in Table [3 As anticipated it is found to provide a marginal improvement in 
a (of < 0.005 kcal mol~^) but at a cost reflected in the reduced and Q statistics. Again for 
completeness, in the appendix we tabulate fltted parameters for AH and parameters. 



AG 


a'' 


xVf 


Q 


NN" 


0.45 


1.57 


0.01 


HI 


0.38 


1.24 


0.12 


H2 


0.38 


1.31 


0.07 



Table 3: Comparison of statistics for models of hybrid duplex formation. "Units are kcal mol~ . 
^NN data was calculated using model parameters (NNl) of Sugimoto et al. HI and H2 
denote rank deficient and rank determinant base-pairing parameter sets respectively. 



6 Conclusion 



In this paper we have refined the idea, first presented in Ref. jlOj . that a description of duplex 
formation in terms of (fictitious) two-body correlations is equivalent to the more commonly 
known NN dimer method. 

For all three cases (RNA, DNA and RNA/DNA hybrid helices) fits to empirical data confirm 
that this is indeed the case. Moreover the connection between rank-deficient polymer and rank- 
determinant oligomer NN parameter sets, known for DNA (see, e.g., Ref. ||25j ) is transparent 
via our approach and enables similar approximations for RNA 10 and RNA/DNA hybrids to 
be made. 

In the instances of RNA and DNA homoduplexes while the former approximation does 
not offer an increase in model accuracy, statistics of model significance (e.g. x^j Q) suggest 
the base-pairing approach to be more favourable on the grounds that it uses fewer and more 
"fundamental" model degrees of freedom. For RNA/DNA hybrids an improvement in both 
accuracy and parameter numbers is also observed over the NN method, indeed the high and 
low Q probability for the latter in Table |S1 strongly suggest a good deal of redundancy in using 
the dimer energy parameters. 

The base-pairing model also suggests the interesting possibility of a kind of universal- 
ity forstabilising interactions in Watson-Crick paired helices. Specifically the same numerical 
weights H20() are found to encapsulate the sequence-dependence for all three types of helices, the 
differences between fitted correlation matrix elements being attributed to the effects of different 
hehcal geometries (A-type for RNA/RNA and RNA/DNA, B for DNA/DNA). 

It should be emphasised that the base-pairing model is only a description of the sequence- 
dependence of helical oligonucleotides in terms of two-body correlations labelled by "weight 
vectors" . However the suggestion of a connection between this picture with an effective interac- 
tion potential in three spatial dimensions is appealing. The nature of these "weight vectors" is 
not yet clear, and further insight may be gained from extending the method to mismatch pairing 
geometries. 
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